ARIMA can also be used for seasonal data. We write it as \(ARIMA(p,d,q) (P,D,Q)_m\). where m is the number of observations per year. The seasonal part of the model consists of terms that are similar to non seasonal component, but involves backshifts of the seasonal period.
An \(ARIMA(1,1,1)(1,1,1)_4\) model without a constant can be written as \[(1-ϕB)(1-ΦB^4)(1-B)(1-B^4) y_t = (1+ \theta_1B)(1+Θ_1B^4)e_t\]
The seasonal part can also be seen in ACF/PACF plot.
For \(ARIMA(0,0,1)(0,0,1)_{12}\) model:
a spike at lag 12 in the ACF but no other significant spike.
exponential decay in the seasonal lag in the PACF plot.
For \(ARIMA(1,0,0)(1,0,0)_{12}\) model:
a spike at lag 12 in the PACF but no other significant spike.
exponential decay in the seasonal lag in the ACF plot.
Example: European Quarterly Retail Trade
library(fpp2)
## Loading required package: ggplot2
## Loading required package: forecast
## Loading required package: fma
## Loading required package: expsmooth
autoplot(euretail) + ylab("Retail Index") + xlab("Year") +
ggtitle("Quarterly Retail Trade Index")
The data are clearly non-stationary with some seasonality present as well. So we will first take a seasonal difference and will see if that makes data stationary.
euretail %>% diff(lag = 4) %>% ggtsdisplay()
The data still looks non stationary. So we will take a first difference as well.
euretail %>% diff(lag = 4) %>% diff(lag = 1) %>% ggtsdisplay()
Now we will try to find a appropriate ARIMA model based on ACF/PACF plot. A significant spike at lag 1 shows non-seasonal MA(1). A significant spike at lag 4 suggest a seasonal MA(1). Hence, initially we select \(ARIMA(0,1,1)(0,1,1)_4\) model, indicating a seasonal and first difference and non-seasonal and seasonal MA(1).
euretail %>%
Arima(order = c(0,1,1), seasonal = c(0,1,1)) %>%
residuals() %>% ggtsdisplay()
Now both the ACF and PACF plots are showing a significant spike at lag 2 that means a non seasonal terms is need to be included. Now we will try \(ARIMA(0,1,2)(0,1,1)_4\)
fit_1 = Arima(euretail, order = c(0,1,2), seasonal = c(0,1,1))
summary(fit_1)
## Series: euretail
## ARIMA(0,1,2)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 sma1
## 0.2303 0.2502 -0.6991
## s.e. 0.1484 0.1188 0.1284
##
## sigma^2 estimated as 0.1789: log likelihood=-32.76
## AIC=73.53 AICc=74.27 BIC=81.84
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0448743 0.3956301 0.2816785 -0.04296557 0.2916949
## MASE ACF1
## Training set 0.2291311 0.06982161
fit_1 %>% residuals() %>% ggtsdisplay()
This time we are getting a spike at 3. Now let’s try \(ARIMA(0,1,3)(0,1,1)_4\)
fit_2 = Arima(euretail, order = c(0,1,3), seasonal = c(0,1,1))
summary(fit_2)
## Series: euretail
## ARIMA(0,1,3)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.2630 0.3694 0.4200 -0.6636
## s.e. 0.1237 0.1255 0.1294 0.1545
##
## sigma^2 estimated as 0.156: log likelihood=-28.63
## AIC=67.26 AICc=68.39 BIC=77.65
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.02965298 0.3661147 0.2787802 -0.02795377 0.2885545
## MASE ACF1
## Training set 0.2267735 0.006455781
fit_2 %>% residuals() %>% ggtsdisplay()
This time the ACF/PACF looks much better, residuals also looks like white noise and also AICc has also improved.
fit_2 %>% forecast(h = 12) %>% autoplot()
The forecast seems to follow the recent trend because of the double differencing.
The large and rapidly increasing prediction intervals show that the retail trade index could start increasing or decreasing at any time — while the point forecasts trend downwards, the prediction intervals allow for the data to trend upwards during the forecast period.
Now we can check if auto.arima function gives the same result or not.
auto.arima(euretail, stepwise = FALSE)
## Series: euretail
## ARIMA(0,1,3)(0,1,1)[4]
##
## Coefficients:
## ma1 ma2 ma3 sma1
## 0.2630 0.3694 0.4200 -0.6636
## s.e. 0.1237 0.1255 0.1294 0.1545
##
## sigma^2 estimated as 0.156: log likelihood=-28.63
## AIC=67.26 AICc=68.39 BIC=77.65
It selects the same model.
Example : Corticosteroid drug sales in Australia
lh02 = log(h02)
cbind("H02 sales (million scripts)" = h02,
"Log H02 sales" = lh02) %>%
autoplot(facets = TRUE) + xlab("Year") + ylab("")
The original time series has small increase in the variance so we have taken log to stabilize it. The data is seasonal in nature and obviously non-stationary, so seasonal differencing will be used.
lh02 %>% diff(lag = 12) %>%
ggtsdisplay(xlab = "Year",
main = "Seasonally differenced H02 scripts")
The residuals are not stationary. PACF plot shows spike at lag 12 and 24 but ACF do not shows any significant lag at seasonal period. This suggest seasonal AR(2) model. PACF plots shows 3 significant spike and ACF do not suggest any simple model. We will start with non-seasonal AR(3). We have \(ARIMA(3,0,0)(2,1,0)_{12}\).
AICC = c()
AICC[1] = Arima(h02, order = c(3,0,0), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[2] = Arima(h02, order = c(3,0,1), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[3] = Arima(h02, order = c(3,0,2), seasonal = c(2,1,0), lambda = 0)$aicc
AICC[4] = Arima(h02, order = c(3,0,1), seasonal = c(1,1,0), lambda = 0)$aicc
AICC[5] = Arima(h02, order = c(3,0,1), seasonal = c(0,1,2), lambda = 0)$aicc
AICC[6] = Arima(h02, order = c(3,0,1), seasonal = c(1,1,1), lambda = 0)$aicc
AICC[7] = Arima(h02, order = c(3,0,1), seasonal = c(0,1,1), lambda = 0)$aicc
AICC
## [1] -475.1230 -476.3086 -474.8826 -463.3951 -485.4754 -484.2511 -483.6674
Of these models, the best is the \(ARIMA(3,0,1)(0,1,2)_{12}\) model.
fit = Arima(h02, order = c(3,0,1), seasonal = c(0,1,2), lambda = 0)
fit %>% checkresiduals()
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,1)(0,1,2)[12]
## Q* = 23.663, df = 18, p-value = 0.1664
##
## Model df: 6. Total lags used: 24
We still some spike in the ACF plot and the model fails the Ljung-Box test. Now we try auto.arima function.
fit_2 = auto.arima(h02, lambda = 0, stepwise = FALSE)
summary(fit_2)
## Series: h02
## ARIMA(2,1,0)(0,1,1)[12]
## Box Cox transformation: lambda= 0
##
## Coefficients:
## ar1 ar2 sma1
## -0.8491 -0.4207 -0.6401
## s.e. 0.0712 0.0714 0.0694
##
## sigma^2 estimated as 0.004387: log likelihood=245.39
## AIC=-482.78 AICc=-482.56 BIC=-469.77
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.003580125 0.05092947 0.03715772 -0.5483892 4.731282
## MASE ACF1
## Training set 0.6129786 0.007272094
checkresiduals(fit_2)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(2,1,0)(0,1,1)[12]
## Q* = 31.132, df = 21, p-value = 0.07148
##
## Model df: 3. Total lags used: 24
The residuals looks normally distributed but the ACF plot shows some spike. Sometimes it is just not possible to find a model that passes all of the tests. One way to go in this situation is to test all the best models on out of time data and select the one with highest forecast accuracy.
When models are compared using AICc values, it is important that all models have the same orders of differencing. However, when comparing models using a test set, it does not matter how the forecasts were produced — the comparisons are always valid.
Forecasts from the \(ARIMA(3,0,1)(0,1,2)_{12}\) model (which has the lowest RMSE value on the test set, and the best AICc value amongst models with only seasonal differencing.
h02 %>%
Arima(order = c(3,0,1), seasonal = c(0,1,2), lambda = 0) %>%
forecast() %>%
autoplot() +
ylab("H02 sales (million scripts)") + xlab("Year")
cbind("No. of tourists" = austourists,
"No. of tourists(log)" = BoxCox(austourists, lambda = BoxCox.lambda(austourists))) %>%
autoplot(facets = TRUE) + xlab("Years")
The data is seasonal in nature and also there is a strong increasing trend in the data.
ggAcf(austourists)
ggPacf(austourists)
ACF plot shows a curve in each season. This is happening because of seasonality in the data. The PACF plot shows significant spike at lag 4 and 8 also some significant spike till lag 5.
austourists %>% diff(lag = 4) %>% ggtsdisplay()
The data shows spike at lag 4 in both the plots and no other significant spike at seasonal lags. This suggests a Seasonal \(ARIMA(1,1,0)_4\). Also, there is spike at lag1 in both the plot, that suggest a Non-seasonal AR(1). The model we are fitting is \(ARIMA(1,0,0)(1,1,0)_4\).
We can fit this model and check residuals.
aust_arima = Arima(austourists, order = c(1,0,0), seasonal = c(1,1,0))
checkresiduals(aust_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,1,0)[4]
## Q* = 11.843, df = 6, p-value = 0.06556
##
## Model df: 2. Total lags used: 8
ggPacf(residuals(aust_arima))
The ACF and PACF plot still shows some spike at lag 1. We can check if including drift gives a better result.
aust_arima_drift = Arima(austourists, order = c(1,0,0), seasonal = c(1,1,0), include.drift = TRUE)
checkresiduals(aust_arima_drift)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0)(1,1,0)[4] with drift
## Q* = 4.0937, df = 5, p-value = 0.536
##
## Model df: 3. Total lags used: 8
This looks much better. Let’s see what model is selected from auto.arima.
auto.arima(austourists, stepwise = FALSE)
## Series: austourists
## ARIMA(1,0,0)(1,1,0)[4] with drift
##
## Coefficients:
## ar1 sar1 drift
## 0.4705 -0.5305 0.5489
## s.e. 0.1154 0.1122 0.0864
##
## sigma^2 estimated as 5.15: log likelihood=-142.48
## AIC=292.97 AICc=293.65 BIC=301.6
Auto.arima also selects the same model.
Fitting seasonal ARIMA 11. Total net generation of electricity (in billion kilowatt hours)
autoplot(usmelec, series = "Data") +
autolayer(ma(usmelec, order = 12, centre = TRUE), series = "2-12 MA") +
xlab("Year") + ggtitle("Total net generation of electricity")
## Warning: Removed 12 rows containing missing values (geom_path).
There is a increasing trend till year around 2008 and then no increasing trend is visible. The data has increasing variance so we will do some BoxCox transformation.
usmelec_lambda = BoxCox.lambda(usmelec)
autoplot(BoxCox(usmelec, lambda = usmelec_lambda))
The data is stationary so we will take one seasonal difference.
usmelec %>% BoxCox(lambda = usmelec_lambda) %>% diff(lag = 12) %>% ggtsdisplay()
The data do not looks stationary still. We can take first order difference and check again
usmelec %>% BoxCox(lambda = usmelec_lambda) %>% diff(lag = 12) %>%
diff(lag = 1) %>%
ggtsdisplay()
The data now looks pretty much stationary except there are some peaks peaks and troughs.
ACF plot shows significant spike at lag 12 and no spike at any multiple of that. PACF shows slow decline at seasonal frequency. This suggest seasonal \(ARIMA(0,1,1)_12\).
Both ACF and PACF plot shows spike at lag 1 and 2. We can select non-seasonal ARIMA(0,1,2).
usmelec_arima1 = Arima(usmelec, order = c(0,1,2), seasonal = c(0,1,1),
lambda = usmelec_lambda)
summary(usmelec_arima1)
## Series: usmelec
## ARIMA(0,1,2)(0,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ma1 ma2 sma1
## -0.4317 -0.2552 -0.8536
## s.e. 0.0439 0.0440 0.0261
##
## sigma^2 estimated as 1.284e-06: log likelihood=2544.75
## AIC=-5081.51 AICc=-5081.42 BIC=-5064.87
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.4393834 7.255962 5.331941 -0.1768805 2.012093 0.5921258
## ACF1
## Training set -0.02547411
ggtsdisplay(residuals(usmelec_arima1))
auto.arima(usmelec, lambda = usmelec_lambda, stepwise = FALSE)
## Series: usmelec
## ARIMA(1,1,1)(2,1,1)[12]
## Box Cox transformation: lambda= -0.5738331
##
## Coefficients:
## ar1 ma1 sar1 sar2 sma1
## 0.3888 -0.8265 0.0403 -0.0958 -0.8471
## s.e. 0.0630 0.0375 0.0555 0.0531 0.0341
##
## sigma^2 estimated as 1.274e-06: log likelihood=2547.36
## AIC=-5082.72 AICc=-5082.54 BIC=-5057.76
The auto.arima gives a slightly better results.